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Abstract 

Although there has been significant progress in the past 
decade, tracking is still a very challenging computer vision 
task, due to problems such as occlusion and model drift. 
Recently, the increased popularity of depth sensors (e.g. 
Microsoft Kinect) has made it easy to obtain depth data at 
low cost. This may be a game changer for tracking, since 
depth information can be used to prevent model drift and 
handle occlusion. In this paper, we construct a benchmark 
dataset of 100 RGBD videos with high diversity, including 
deformable objects, various occlusion conditions and mov- 
ing cameras. We propose a very simple but strong base- 
line model for RGBD tracking, and present a quantitative 
comparison of several state-of-the-art tracking algorithms. 
Experimental results show that including depth information 
and reasoning about occlusion significantly improves track- 
ing performance. The datasets, evaluation details, source 
code for the baseline algorithm, and instructions for sub- 
mitting new models will be made available online after ac- 
ceptance. 

1. Introduction 

In the last decade, tracking algorithms have evolved sig- 
nificantly in both their sophistication and quality of results. 
However, tracking is still considered a very challenging task 
in computer vision, particularly because a slight mistake in 
one frame may be reinforced after an online learning step, 
resulting in the so-called model drift problem. Furthermore, 
occlusion of target objects occurs quite often in real world 
scenarios, and it is not clear how to model occlusions ro- 
bustly. The state-of-the-art methods (e.g. [16, 2, 13, 31]) 
usually employ very powerful learning and energy mini- 
mization methods in the hopes of better handling these is- 
sues. 

Fortunately, we are moving into a 3D era for digi- 
tal devices. Accurate and affordable depth sensors, such 
as Microsoft Kinect, Asus Xtion and PrimeSense, makes 
depth acquisition easy and cheap. With an accurate depth 
map, many traditional computer vision tasks become signif- 




Figure 1 . Examples of our RGBD tracking benchmark dataset with 
manual annotation of all frames. 



icantly easier (e.g. human pose estimation [ ]). For track- 
ing, the depth map can provide valuable additional informa- 
tion to significantly improve results with much more robust 
occlusion and model drift handling. 

How much does depth information help in tracking? 
What is the baseline performance for tracking given the 
depth information? How far are we from claiming that 
we have solved the tracking problem if we have reliable 
depth accessible? What is a reasonable baseline algorithm 
for tracking with RGBD data, and how do the state-of-the- 
art RGB tracking algorithms perform compared with this 
RGBD baseline? 

This paper seeks to answer these questions by proposing 
a very simple but powerful baseline algorithm and conduct- 
ing a quantitative benchmark evaluation. To build a rea- 
sonable baseline, we use the state-of-the-art HOG features 
[8, 12] sliding window detection with linear SVM [7, 6], 
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Figure 2. Illustration of our baseline RGBD tracking algorithm. The 2D confidence map is the combined confidence map from classifier 
and optical flow tracker. The ID depth distribution is a Gaussian estimated from target depth histogram. A 3D confidence map is computed 
by applying threshold from the ID Gaussian on the 2D confidence map. In the output, the target location (the green bounding boxes) is 
position of the highest confidence. Occluder is recognized from its depth value. 



which incorporate depth information to prevent model drift, 
and robust optical flow [ ], and propose a very simple model 
to represent the depth distribution for occlusion handling. 

To evaluate the algorithms, we construct a large RGBD 
video dataset of 100 videos with high diversity, including 
deformable objects, various occlusion conditions, and mov- 
ing cameras, under different lighting conditions and in dif- 
ferent scenes (Figure 1). We aim to lay the foundation for 
further research in this task, for both RGB and RGBD track- 
ing approaches, by providing a good benchmark and base- 
line. We will withhold the ground truth annotation for a 
portion of the dataset, provide instructions for submitting 
new models, and host an online evaluation server to allow 
public submission of the results from new models. 

1.1. Related works 

There are many noteworthy tracking algorithms which 
have been proposed in the last decade. Here we briefly 
summarize only a partial list of them, due to space con- 
straints. [ ] proposes a very robust system with online mul- 
tiple instance learning. [16] designs a framework to in- 
tegrate tracking, learning, and detection using P-N loops. 
| ] uses semi-supervised online boosting to increase track- 
ing robustness. [ ] learns a view-based representation to 
account for object articulations, while [ ] handles it using 
a fragments-based model. To address target appearance 
changes, [ ] uses a Gaussian Mixture Model built from 
online expectation maximization (EM), and [20] presents 
an incremental subspace learning algorithm. More re- 
cently, [ ] proposes using compressive sensing for real- 
time tracking, and [ ] presents structured output predic- 
tion to avoid intermediate classification. There are also 
some important works on multiple target tracking and mo- 
tion flow estimation, such as [17, 28]. 

There has been also some seminal works on tracking us- 



ing RGBD cameras [18, 24, 19, 25], but they all focus on 
tracking the human body. The publicly available RGBD 
People Dataset [24, 19] contains only one sequence with 
1132 frames captured with static cameras with only people 
moving, which is obviously not enough to evaluate tracking 
algorithms for general objects. 

There has been several great benchmarks for various 
computer vision tasks that help to advance the field and 
shape computer vision as a rigorous experimental science, 
e.g. two- view stereo matching benchmark [ ], multiple- 
view stereo reconstruction benchmark [ ], optical flow 
benchmark [ ], Markov Random Field energy optimizaiton 
benchmark [ ], object classification, detection, and seg- 
mentation benchmark [11], scene classification benchmark 
[29] and large scale image classification benchmark [ ]. 
This paper is an addition to the list to provide a benchmark 
of tracking, for both RGB and RGBD video. 

2. Baseline algorithm 

The goal is to build a simple but strong baseline algo- 
rithm leveraging state-of-the-art feature, detection and op- 
tical flow algorithms with simple but reasonable occlusion 
handling. An overview of the baseline tracking algorithm is 
shown in Figure 2. 

2.1. Detection and optical flow 

Our baseline algorithm includes a linear support vector 
machine classifier (SVM [7, 6]) based on RGBD features, 
an optical flow tracker [ ] and a target depth distribution 
model, which are initialized by the input bounding box from 
the first frame and updated online. The RGBD feature we 
used is histogram of oriented gradients (HOG[ , ]) from 
both RGB and depth data (Figure 3). HOG for depth is ob- 
tained by treating depth data as a gray scale image. This 
RGBD HOG feature describes local textures as well as 3D 




(a) HOG of RGB (b) RGB image (c) HOG in Depth 

Figure 3. The features we used for our baseline algorithm. 

shapes, in which the target is more separable from back- 
ground as well as occluder, and therefore improves the ro- 
bustness against model drifting, especially when there is il- 
lumination variation, lack of texture, or high similarity be- 
tween target and background color. 

In the subsequent frame, a HOG pyramid is computed, 
and a sliding window is run using a convolution of the S VM 
weights, which returns several possible target locations with 
their confidence. Confidence of these locations are then ad- 
justed according to the bounding box estimated from optical 
flow tracker [5] in the following way: 



c = c d + ac t r^ d ) 



(1) 



in which Cd is the confidence of detection, c t is the con- 
fidence of optical flow tracking, and r^^) is the ratio of 
overlap between the detection and optical flow tracker's re- 
sulting bounding boxes, i.e. an indication of their consis- 
tency, a denotes the weight of the overlap ratio (a = 0.5 
in our experiment). After this step, the target depth dis- 
tribution model, a Gaussian distribution learned from pre- 
vious frames, discards bounding boxes far from estimated 
target depth. The most probable remaining bounding box is 
picked and re-centered towards the center of the nearby re- 
gion whose depth agrees with target depth model. Such re- 
centering helps prevent drifting of output bounding boxes. 
Afterwards, the target models are updated using this bound- 
ing box with hard negative mining and the tracker proceeds 
to the next frame. 

2.2. Occlusion handling 

In order to handle occlusions, some traditional RGB 
trackers like [16] use forward-backward error to indicate 
tracking failure caused by occlusion, and some others like 
[1, 2] use a fragment-based model to reduce the models' 
sensitivity to partial occlusion. However, with depth infor- 
mation the solution for this issue becomes more straight- 
forward. Here we propose a simple but effective occlusion 
handler which actively detects the target occlusion and re- 
covery during tracking process. 

Occlusion detection We assume that the target is the clos- 
est object that dominates the bounding box when not oc- 
cluded. A new occluder in front of the target inside the 
bounding box indicates the beginning of occlusion state. 
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Figure 4. Depth distribution inside the bounding box. The top row 
shows the distribution in normal state, and the bottom row shows 
the distribution when occlusion occurs. The red Gaussian denotes 
the target model, and the green denotes the occluder model. 

Therefore, depth histogram inside bounding box is expected 
to have a newly rising peak with a smaller depth value than 
target, and/or a reduction in the size of bins around the tar- 
get depth, as illustrated in Figure 4. 

The depth histogram hi of all pixels inside a bounding 
box can be approximated as a Gaussian distribution for the 
z-th frame: 

h^N^a}). (2) 
And we define the likelihood of occlusion for this frame as: 



Ehi(d) ' 

d 



(3) 



where hi (d) is the count in the d-th bin for the z-th frame, 
and d = is the depth of the camera, fii — &i is a thresh- 
old for a point to be considered as occluder. The number 
of pixels that have smaller depth value than target depth is 
considered the area of the occluder that has appeared in the 
bounding box. Hence, a larger Oi indicates that an occlu- 
sion is more likely. The target depth value is updated online, 
so a target moving towards the camera will not be treated as 
an occlusion. 

Under occlusion Our occlusion model, i.e. the occluder's 
depth distribution, is initialized when entering the occlusion 
state. In the following frames, the occluder's position is up- 
dated by the optical flow tracker. A list of possible target 
candidates are identified either by the RGBD detection or 
a local search around the occluder. With color and depth 
distributions of target and occluder, the local search is done 
by performing segmentation on RGB and depth data respec- 
tively and combining their results. The combined segmen- 
tation produces a list of target candidates (Figure 5), whose 
validity is then judged by the S VM classifier. If there is no 
candidate in the searching range or all of them are invalid, 
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Figure 5. Local search for target candidates by segmentation, (a) 
RGB image (the target location is indicated by the green bounding 
box, the occluder indicated by the blue bounding box) (b) depth 
segmentation (c) RGB segmentation (d) final segmentation result 
along with possible target candidates. 



the tracker just tracks the occluder, preparing for the next 
frame. 

Recovery from occlusion By examining the list of pos- 
sible target candidates the tracker interprets target recovery 
when at least one candidate from the list satisfies the fol- 
lowing condition: (1) the candidate's visible area is large 
enough compared to target area before entering occlusion, 
(2) the overlap between the occluder and the candidate is 
small and (3) the SVM classifier reports a high confidence. 
The occlusion subroutine ends if the target is recovered 
from occlusion. 

3. RGBD Tracking Benchmark 

3.1. Dataset construction 

Several testing sets of RGB videos have been developed 
to measure the performance of different trackers. How- 
ever, these datasets do not contain depth information and 
thus are not suitable for our purpose. In order to evaluate 
the performance improvement from depth information, we 
recorded a benchmark dataset consisting of 100 video clips 
with both RGB and depth data, manually annotated to con- 
tain the ground truth. 

Hardware setup Our testing data set is captured using a 
standard Microsoft Kinect. It uses a paired infrared pro- 
jector and camera to calculate depth value, thus its perfor- 
mance is severely impaired in an outdoor environment un- 
der direct sunlight. Also, Kinect requires a minimum and 
a maximum distance from the object to the cameras in or- 
der to obtain accurate depth value. Due to the above con- 
straints, our videos are captured indoor, with object depth 
value ranging mainly from 0.8 to 6 meters. 

Annotation We manually annotate the ground truth (tar- 
get location) of the dataset by drawing bounding box on 
each frame as follows: A minimum bounding box cover- 
ing the target is initialized on the first frame. On the next 
frame, if the target moves or its shape changes, the bound- 
ing box will be adjusted accordingly; otherwise, it remains 
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Figure 6. Statistics of our RGBD tracking benchmark dataset. 

the same. One author manually annotated all frames, to en- 
sure high consistency. Because we manually annotate each 
frame, there is no interpolation or choosing of key frames. 
When occlusion occurs, the ground truth is defined as the 
minimum bounding box covering only the visible portion 
of the target. For example, if a person is occluded and so 
that only his/her left arm can be seen, then we provide the 
bounding box of the left arm instead of a predicted position 
of the whole human body. When the target is completely 
occluded there will be no bounding box for this frame. The 
same labeling criteria is also used in PASCAL VOC chal- 
lenge. We annotate all following frames in this way. 

3.2. Dataset statistics 

Since the aim of the dataset is to cover as many scenar- 
ios as possible in real world tracking applications, the di- 
versity of the video clips is important. Figure 6 summaries 
the statistics of our RGBD tracking dataset, which presents 
varieties in the following aspects: 

Target type We divide targets into three types: human, 
animal and relatively rigid object. Rigid objects, such as 
toys and human faces, can only translate or rotate. An- 
imals include dogs, rabbits and turtles, whose movement 
usually consists of out-of-plane rotation and some deforma- 
tion. The degrees of freedom for human body motion is 
very high, and body parts, such as arms and legs, are of- 
ten slim, resulting in a variety of deformation which may 
increase the difficulty for tracking. 

Target speed Tracking difficulty is often related to target 
speed. We denote target speed using 1 — r(^ +1 \, where 
r (M+i) is m e ratio of overlap between target bounding 
boxes in two consecutive frames when no occlusion occurs. 
Target speed of a video sequence is defined by its maximum 
during the sequence. Compared to the real speed of the tar- 
get, this definition of speed has a more direct influence on 
tracking performance, as it takes into account differences in 
frame rate. Average target speed in our video data set ranges 
from 0.057 to 0.599. 
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Figure 7. Average error rate composed of three types evaluated on different categories of test cases. 



Scene type Background clutter is also an important factor 
affecting tracking performance. In our data set, we pro- 
vide several types of scenes to with different levels of back- 
ground clutter. The scenes include cafe, concourse, library, 
living room, office, playground and sports field. The living 
room, for example, has a simple and mostly static back- 
ground, while the background of a cafe is complex, with 
many people passing by. 

Presence of occlusion Out of 100 videos in our dataset, 
occlusion occurs in 63 videos, in which the targets are to- 
tally occluded in 16.3 frames on average. The videos cover 
several aspects that may affect tracking performance under 
occlusion, e.g. how long the target is occluded, whether the 
target moves or changes in appearance during occlusion, 
and the similarity between the occluder and the target. 

3.3. Evaluation metric 

We used two metrics to evaluate the proposed baseline 
algorithm with other state-of-the-art trackers. One metric is 
center position error (CPE) which is the Euclidean distance 
between the centers of output target bounding boxes and the 
ground truth. The above metric shows how close the track- 
ing results are to the ground truth in each frame. However, 
the overall performance of the trackers cannot be measured 
by averaging this distance. When the trackers are misled 
by background clutter, such distances can be huge, thus the 
average distance may be dominated by only a few frames. 
Also, this distance is undefined when trackers fail to output 
a bounding box or there is no ground truth bounding box 
(target is totally occluded). 

To evaluate the overall performance, we employ the cri- 
terion used in the PASCAL VOC challenge [ ], the ratio 
of overlap n between the output and true bounding boxes: 



minimum overlapping area r t , we can calculate the average 
success rate R of each tracker as follows: 



area(ROI T - nROI Gi ) 
area(ROI Ti UROI G , ) 
1 



ifbothROI Ti andROI G , exist 
if both ROI T , and ROl Gi not exist 
— 1 otherwise 

(4) 

where ROI^ is the target bounding box in the i-th frame 
and ROI^i is the ground truth bounding box. By setting a 
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where ui is an indicator denoting whether the output bound- 
ing box of the i-th frame is acceptable, and N is the number 
of frames. According to the above calculation, the mini- 
mum overlap ratio r t makes a hard decision on whether an 
output is valid or not. Since some trackers may produce out- 
puts that have small overlap ratio over all frames while oth- 
ers give large overlap on some frames and fail completely 
on the rest, r t must be treated as a variable to conduct a fair 
comparison. 

In Figure 7, we further divide tracking failures into three 
types: 

Type I iROI^ ^ null and ROI Gi ^ null and < r t 
Type II :ROI Ti ^ null and ROI Gi = null 
Type III :ROI Ti = null and ROI Gi ^ null 

Type I error is the case where target is visible, but tracker's 
output is far away from the target. Type II error is where 
target is invisible but tracker outputs a bounding box. Type 
III error is where target is visible but tracker fails to give 
any output. 

Running time of each algorithm is not included in our 
metrics, because our main focus is on the performance of 
the tracking algorithm on RGBD data. For our current im- 
plementation of the baseline algorithm, we tried to keep the 
system as simple as possible, so the code is written in Mat- 
lab and is not optimized for speed at all. There are many po- 
tential ways to speed up the algorithm, if we wish to use it in 
real time applications. For example, instead of a naive con- 
volution for the sliding window detector, we can use [10] for 
acceleration. Instead of an optical flow [ ] running in CPU, 
we can use an optical flow running on GPU [26]. Further- 
more, there are also many ways to maximize the efficiency 
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Figure 8. Average success rate vs. threshold of overlap ratio (rt) evaluated on different categories of test cases. 



using special hardware, such as FPGA or other customized 
hardware ASIC circuits. 

3.4. Evaluation results 

To understand how much of the performance improve- 
ment is due to the use of depth data and how much is due to 
occlusion handling, we tested four versions of our proposed 
baseline tracker, which are: 

RGB uses RGB features without occlusion handling. 

RGBD uses RGBD features without occlusion handling. 

RGBOcc uses RGB features with occlusion handling. 

RGBDOcc uses RGBD features with occlusion handling 
enabled, which is our complete baseline algorithm. 

We also compare the baseline algorithms to four state- 
of-the-art RGB trackers: TLD[16], CT[31], MIL[2], semi- 
B[13], The performance measured by CPE and the corre- 
sponding snapshots are shown in Figure 9, and the suc- 
cess rates measured by overlap ratio are shown in Figure 
8. Error decomposition of each tracker is shown in Figure 
7. Furthermore, we define an average ranking of different 
algorithms, based on a combination of several indicators, as 
shown in Table 1 . 

We can clearly see that the proposed baseline RGBD 
tracker significantly outperforms all others, which indicates 
that the extra depth map with some occlusion reasoning pro- 
vides valuable information which helps to achieve a better 
tracking result. The proposed methods use very powerful 
but more computationally expensive classifiers (with hard 
negative mining) as well as a state-of-the-art optical flow 
algorithm, while other trackers mainly focus on real-time 
performance. Thus our RGB tracker is expected to have 
higher accuracy at the cost of longer running time. How- 
ever, the effect of using depth data can still be seen by com- 
paring the results of the tracker with depth input (RGBD) 
and without (RGB). With depth data, error is reduced by 
10.9%. After enabling the occlusion handler of the RGBD 
tracker, its error rate further decreased by 12.3%. When 
compared with other state-of-art trackers, the proposed al- 
gorithm achieves an average 42.3% reduction on error rate. 



In particular, when occlusion is present the occlusion de- 
tection and handling is critical to reduce error, as shown in 
Figure 7 and 8 (b). 

Distinguishing three types of error helps analyze differ- 
ent sources of error. For example, TLD and SemiB have a 
relative high Type III error, suggesting that their models are 
sensitive to target appearance change or partial occlusion, 
while MIL, CT, and RGBD have high Type II error, result- 
ing from the lack of an active occlusion detection mecha- 
nism. However, each error type cannot be considered sep- 
arately as a direct indicator of performance. For example, 
MIL and CT use target models which are less sensitive to 
occlusion and thus have a very low Type III error at the cost 
of high Type II error when target is occluded, and possible 
high Type I error in the following frames after occlusion if 
trackers are misled by the occluder. Our proposed tracker 
robustly handles different scenarios and achieves the lowest 
overall error rate. 

3.5. Discussion 

From the evaluation results obtained in the previous sec- 
tion, we observed that traditional RGB trackers produce rel- 
ative high error in the following scenario: 

Target rotation and deformation Target rotation, espe- 
cially out-of-plane rotation, and deformation are the main 
causes of model drifting for traditional RGB trackers. Tar- 
get appearance can change significantly after rotation or 
deformation, making recognition difficult. In the video 
"stuffed bear" (Figure 9 Row 1), TLD, CT, MIL and Semi- 
B lose tracking when the stuffed bear starts to rotate out 
of plane. In "basketball player" (Figure 9 Row 2), those 
trackers gradually fail to follow the player as he moves his 
arms and legs. However, the RGBD tracker was robust in 
these situations as we used depth information to identify 
the target. The depth feature is still distinguishable when 
the similarity in RGB vanishes. 

Different types of occlusion There are several factors 
which may affect the difficulty of tracking under occlu- 
sion: size of target's occluded portion, target movement or 



Table 1. Evaluation results: successful rate % and corresponding ranking (in parentheses) under different categorizations. 
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RGBDOcc 


1 


82.0(1) 


60.2(1) 


81.9(1) 


82.9(1) 


74.3(1) 


77.5(1) 


77.3(1) 


76.1(1) 


79.5(1) 


81.5(1) 


75.7(1) 


RGBD 


2.36 


66.0(2) 


58.3(2) 


68.2(3) 


69.5(2) 


62.2(2) 


64.9(2) 


65.8(2) 


61.1(3) 


73.1(2) 


67.7(3) 


63.7(2) 


RGBOcc 


2.63 


65.3 (3) 


49.2(3) 


73.4(2) 


64.9(3) 


62.0(3) 


64.8(3) 


63.1(3) 


61.5(2) 


66.5(3) 


75.4(2) 


58.1(3) 


RGB 


4 


54.7 (4) 


48.4(4) 


56.8(4) 


55.4(4) 


53.5(4) 


55.5(4) 


53.2(4) 


50.7(4) 


60.9(4) 


62.0(4) 


51.0(4) 


TLD[16] 


6.09 


28.2(7) 


30.7(8) 


43.7(5) 


32.4(7) 


38.5(5) 


44.8(6) 


29.5(5) 


34.3(5) 


39.3 (7) 


47.5(5) 


31.8(7) 


CT[31] 


6.27 


33.0 (6) 


43.6(5) 


33.4(8) 


41.3(5) 


33.5(7) 


47.3(5) 


27.5(8) 


23.8(8) 


59.2(5) 


38.9(7) 


35.3(5) 


MIL[ ] 


6.54 


34.3(5) 


34.8(6) 


34.2 (7) 


39.6(6) 


32.9(8) 


40.4(8) 


29.5 (5) 


27.7 (7) 


48.1(6) 


38.8(8) 


34.0(6) 


SemiB[ ] 


6.90 


26.1(8) 


31.7(7) 


38.8(6) 


27.6(8) 


35.0(6) 


44.8(6) 


29.0(7) 


31.8(6) 


34.1(8) 


46.3(6) 


26.7(8) 



appearance variations during occlusion, similarity between 
occluder and target, and background clutter. 

When partially occluded, the target appearance is less 
similar to the pre-trained models and often cannot pass the 
threshold. In the video "human face" (Figure 9 Row 4), if 
only RGB data is available, fragment based trackers can lo- 
cate the target but sometimes mistake background clutter for 
the target, because with only part of target visible, the de- 
tection confidence drops. Conservative approaches, which 
do not produce output with very low confidence, often lose 
tracking. When the target is completely occluded (video 
"sign", Figure 9 Row 3), optical flow tracking becomes un- 
informative. However, from depth data, our method is able 
to identify the occluder and raise the confidence in its neigh- 
boring 3D region, compensating for the confidence loss due 
to partial occlusion, and thus identifies the target more ac- 
curately. 

If the occlusion happens gradually, the occluder, if not 
excluded, slowly grows inside the target bounding box and 
finally dominates the bounding box (video "student with 
bag", Figure 9 Row 5). On this occasion, optical flow track- 
ers and classifiers are often misled to track or detect the oc- 
cluder. It is difficult for the trackers to make corrections af- 
terwards because their models are updated incorrectly. Our 
method detects occlusion more reliably using depth data 
to recognize the occluder. And by only examining objects 
around the occluder, we prevent outputting the occluder as 
the result and update models accordingly. 

4. Conclusions 

Thanks to the great popularity of low cost depth sensors 
in the consumer market, tracking can be made easier by 
using reliable depth data as input. Object depth data pro- 
vides information for discriminating between different ob- 
jects, which cannot be obtained from RGB data alone. But 
there are many questions about how valuable such reliable 
depth information is for handling occlusion and prevent- 
ing model drift. In this paper, we construct a benchmark 
dataset of 100 RGBD videos with high diversity, including 
deformable objects, various occlusion conditions and mov- 
ing cameras. We propose a very simple but strong baseline 



model for RGBD tracking, and present a quantitative com- 
parison of several state-of-the-art tracking algorithms. We 
have demonstrated that by incorporating depth data, track- 
ers can achieve better performance and handle occlusion 
more easily as well as more accurately. With depth data, the 
baseline RGBD tracker outperforms current state-of-the-art 
RGB trackers significantly. 

We believe that this benchmark dataset and baseline al- 
gorithm can provide a better comparison of different track- 
ing algorithms, and start a new wave of research advances in 
the field by making experimental evaluation more standard- 
ized and easily accessible. The datasets, evaluation details, 
source code of the baseline algorithm, and instructions for 
submitting new models will be made available online after 
acceptance. In future work, we would like to investigate 
stronger models for both RGB and RGBD tracking, such as 
modeling deformable objects by parts [12, 30]. 
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